added snakemake-like Rule #638

jonathanBieler · 2025-07-22T13:44:14Z

This adds a way to declare rules that binds input and outputs files and rerun them if files are outdated/not existing, a bit like in snakemake (although quite basic). I've toyed with more advanced way of (re)-triggering the computation but I think it would require a much complicated machinery (CLI, caching or runs, hashing of the code, etc), so here I just added a way to manually retrigger a computation with a keyword.

If there's interest for this I can add docs.

cf. https://discourse.julialang.org/t/dagger-dates-snakemake/126111

Example :

using CSV, DataFrames, Dagger, Statistics

## prepare data

dir = mktempdir()

mean_squared_input = Float64[]
for sample_idx in 1:5
   x = rand(10)
   CSV.write("$(dir)/sample_$(sample_idx).csv", DataFrame(x=x))
   push!(mean_squared_input, mean(x.^2))
end

samples = ["$(dir)/sample_$(sample_idx).csv" for sample_idx in 1:5]

## define rules 

# function that creates a Rule for a given sample
get_rule_square(sample) = Dagger.Rule(sample => replace(sample, "sample_" => "sample_squared_"); forcerun=false) do input, output
   df = CSV.read(input[1], DataFrame)
   df.xsquared = df.x .^ 2
   CSV.write(output[1], df)
   output
end

squared_rules = get_rule_square.(samples)
squared_rule_outputs = [only(r.outputs) for r in squared_rules]

make_summary = Dagger.Rule(squared_rule_outputs => "$(dir)/samples_summary.csv"; forcerun=false) do inputs, output
   dfs = CSV.read.(inputs, DataFrame)
   mean_squared = DataFrame(sample = inputs, mean_squared = [mean(df.xsquared) for df in dfs])
   CSV.write(output[1], mean_squared)
   output
end

## Run 

squared = [Dagger.@spawn r() for r in squared_rules]
summary_file = Dagger.@spawn make_summary(squared...)

out = CSV.read(fetch(summary_file), DataFrame)

@assert out.mean_squared == mean_squared_input

@assert Dagger.needs_update(make_summary) == false
run(`touch $(squared_rule_outputs[1])`)
sleep(0.5)
@assert Dagger.needs_update(make_summary) == true

run(`rm $(squared_rule_outputs[1])`)
summary_file = Dagger.@spawn make_summary(squared...) # fails

squared = [Dagger.@spawn r() for r in squared_rules] # redo only 1 file

summary_file = Dagger.@spawn make_summary(squared...)

jpsamaroo · 2025-08-05T23:23:59Z

Very cool! Right now this API feels a bit cumbersome to me, so I'd like us to think on ways to make things feel "smoother" and more automatic (not that I really know what that means right now). Something that would be really nice is if this would integrate with Dagger.File and Dagger.tofile, which are used for lazy-loading and saving of files, respectively. Again, not sure what that would look like, but I'm open to ideas 😄

jonathanBieler · 2025-08-12T12:50:44Z

Well that's the API right now :

Dagger.Rule(
    user_function,
    input_path => output_path
)

I'm not sure it can be much simpler. Maybe if input_path/output_path aren't defined it could default to some tofile based caching ? Often you want to start with raw data and end up with some "publishable" outputs, plots, a report, ... but the intermediate steps you might not care too much about, so if they could be automatically managed it would be nicer.

But I agree the whole thing is a bit cumbersome. One issue is that in snakemake you define the rules using files as input/outputs and then snakemake will build and execute the graph for you. Here you have to define the rules and still build the graph manually by spawning things in the right order with the right arguments.

Another issue is that if you modify the user_function the code won't rerun since it checks only the input/output dates but not the code, so you can get wrong results if you're not careful.

My intuition is that you either have to go with a bare bones design (what I've tried to do) and let the user do most of the work, or go all in with caching, a CLI, ... like snakemake/nextflow (which would be a separate package I think), and that something in between is a bit awkward.

jpsamaroo · 2025-09-03T19:46:55Z

Sorry for the slow replies!

or go all in with caching, a CLI, ... like snakemake/nextflow (which would be a separate package I think), and that something in between is a bit awkward.

I think, given how Dagger already handles lots of auto-parallelization details for the user for our other APIs, this API should also be "auto-magical" to a strong extent. We have a number of tools available in Dagger to do efficient resource scheduling, streaming data reading/writing, multi-node communication, etc., so I feel like this API should exploit that, and add its own capabilities (such as automatic rule triggering) on top.

Jonathan Bieler added 2 commits July 22, 2025 13:25

added snakemake-like Rule

60a6986

replaced mean by sum in tests

8f44c45

jpsamaroo added speculative needs docs labels Aug 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

added snakemake-like Rule #638

added snakemake-like Rule #638

Uh oh!

jonathanBieler commented Jul 22, 2025

Uh oh!

jpsamaroo commented Aug 5, 2025

Uh oh!

jonathanBieler commented Aug 12, 2025

Uh oh!

jpsamaroo commented Sep 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

added snakemake-like Rule #638

Are you sure you want to change the base?

added snakemake-like Rule #638

Uh oh!

Conversation

jonathanBieler commented Jul 22, 2025

Uh oh!

jpsamaroo commented Aug 5, 2025

Uh oh!

jonathanBieler commented Aug 12, 2025

Uh oh!

jpsamaroo commented Sep 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants